feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22
Merged
josecelano merged 10 commits intomainfrom Apr 27, 2026
Merged
feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22josecelano merged 10 commits intomainfrom
josecelano merged 10 commits intomainfrom
Conversation
Documents the full resize workflow: pre-resize baseline capture, graceful shutdown, Hetzner panel action by human operator, post-resize recovery and validation, evidence capture, and 7-day observation period. Notes the key behaviour that Hetzner in-place resizes preserve all IP addresses (public, private, and Floating IPs), so no DNS or IP reassignment is needed. Refs: #21
The default conntrack table (262144 entries) fills up under sustained UDP tracker load, causing "nf_conntrack: table full, dropping packet" kernel errors and intermittent UDP timeouts on uptime monitors. Applied kernel tunables: - nf_conntrack_max: 262144 → 1048576 (4x increase) - nf_conntrack_udp_timeout_stream: 120 s → 15 s (8x reduction) - nf_conntrack_udp_timeout: 30 s → 10 s Added /etc/modules-load.d/conntrack.conf to pre-load the nf_conntrack module at boot so sysctl settings are applied before Docker starts. Without this, net.netfilter.* keys don't exist when sysctl runs and the settings are silently skipped after a reboot. Refs: #21
Fill in the D+1 row (2026-04-20) in the daily checks log: - HTTP: ~1564 req/s, UDP: ~1015 req/s, total ~2579 req/s (~322/vCPU) - Host load: 6.05/5.49/4.80 - UDP newTrackon uptime: 83.9% (includes resize downtime + conntrack overflow period; fix applied same day) Update the pre/post comparison table with available metrics and mark the decision as "partial" — resize alone was insufficient, conntrack overflow was the actual bottleneck. Follow-up plan added. Refs: #21
…window - Fill D+3–D+7 daily log rows; all post-fix days show uptime recovering to 99.9% by D+7 (2026-04-27) - Add D+7 newTrackon snapshot: both HTTP and UDP trackers at 99.90% - Add D+7 live conntrack verification: table at 32.6% utilization, no table-full dmesg events, zero IPv4 UDP receive-buffer errors - Flip decision in 03-pre-post-comparison.md from Partial → Success - Update main issue doc Current State to 2026-04-27 RESOLVED Refs: #21
- add a permanent runbook for diagnosing and fixing UDP conntrack saturation - add a reusable workspace skill for checking conntrack-related UDP loss - link infrastructure and issue docs to the new canonical guidance - add shared spell-check terms for the new runbook and skill Refs: #21
Member
Author
|
ACK dcc3cc3 |
6 tasks
This was referenced Apr 27, 2026
josecelano
added a commit
to torrust/torrust-website
that referenced
this pull request
Apr 27, 2026
…UDP tracker 2762210 docs: add blog post on nf_conntrack overflow with Docker UDP tracker (Jose Celano) Pull request description: ## Summary Adds a new blog post documenting the `nf_conntrack` table exhaustion problem that caused UDP tracker downtime on both the DigitalOcean and Hetzner Torrust demos. ## What the post covers - **Mechanism** — how Docker bridge DNAT forces connection tracking for UDP flows, and why the table fills under tracker load - **Symptom** — UDP availability drops while HTTP stays healthy, self-recovering outages, application log completely silent - **Diagnosis** — `dmesg`, `/proc/sys/net/netfilter/nf_conntrack_count`, `conntrack -S` - **Our experience** — three incidents across two demos (DigitalOcean × 2, Hetzner × 1); post-fix UDP uptime confirmed at 99.9% - **The fix** — three-parameter sysctl config (`nf_conntrack_max`, `udp_timeout`, `udp_timeout_stream`) + module pre-load for reboot persistence - **Hash table sizing** — `nf_conntrack_buckets` / `hashsize` to avoid O(n) lookup degradation after raising the ceiling - **Reboot persistence trap** — why sysctl settings silently vanish after reboot without `modules-load.d` - **Alternative approaches** — host networking (`--network=host`), `NOTRACK` rules (with real-world failure story from torrust/torrust-demo#72), and macvlan - **Monitoring** — `conntrack -S` early_drop counter, 80% fill-level alerting rule - **Independent documentation** — links to the Aquatic tracker Docker guide that covers the same problem ## Related issues - torrust/torrust-demo#26 — first occurrence (DigitalOcean) - torrust/torrust-demo#72 — second occurrence + failed NOTRACK attempt - torrust/torrust-tracker-demo#21 — third occurrence (Hetzner) - torrust/torrust-tracker-demo#22 — PR that deployed the fix ACKs for top commit: josecelano: ACK 2762210 Tree-SHA512: 593ac524b72d051b0330ec3a6cd006e155e56ac3aa17ffc03b426936c0c9f5313391f2920f604b55aad29e2bb82e3dea428fd1b1d9dfd691e28e04666b0cf2b2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scales the Hetzner server from CCX23 (4 vCPU, 16 GB RAM) to CCX33 (8 vCPU,
32 GB RAM) to address the UDP uptime issues tracked in #19 and #21. The
observation window is complete. This PR includes the full evidence trail,
the conntrack fix required to sustain uptime, and permanent operational
documentation.
What Happened
The resize alone was not sufficient. A secondary root cause was discovered
during the observation window: Docker DNAT creates one conntrack entry per UDP
packet. With the default
nf_conntrack_max=262144and a 120 s UDP streamtimeout, the conntrack table filled under load, silently dropping packets.
Fix applied (2026-04-20):
nf_conntrack_max=1048576(4× previous)nf_conntrack_udp_timeout=10nf_conntrack_udp_timeout_stream=15nf_conntrackkernel module pre-loaded via/etc/modules-load.d/conntrack.confAfter this fix, UDP uptime rose from ~92% to 99.90% and has held there for
the full 7-day post-fix window.
Outcome
Acceptance Criteria
03-pre-post-comparison.mdChanges
Evidence trail:
docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md— issue spec, now marked RESOLVEDdocs/issues/evidence/ISSUE-21/00-pre-resize-baseline.md— pre-resize Prometheus measurementsdocs/issues/evidence/ISSUE-21/01-resize-execution.md— full resize logdocs/issues/evidence/ISSUE-21/02-post-resize-daily-checks.md— 7-day daily log (D+1–D+7 filled)docs/issues/evidence/ISSUE-21/03-pre-post-comparison.md— pre/post comparison, decision: SuccessServer configuration (deployed and in-repo):
server/etc/sysctl.d/99-conntrack.conf— conntrack kernel parametersserver/etc/modules-load.d/conntrack.conf— ensuresnf_conntrackloads at bootPermanent operational documentation:
docs/udp-conntrack-runbook.md— how to detect, fix, and validate conntrack saturation and softirq imbalance (including RPS/RFS how-to).github/skills/check-udp-conntrack/skill.md— agent workflow for future conntrack health checksInfrastructure docs updated:
docs/infrastructure.md— updated traffic figures and added runbook linkdocs/infrastructure-resize-history.md— new file; resize events log with linksRefs: #21